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Abstract 

This paper describes the Context Tree Switching technique, a modification of Context Tree 
Weighting for the prediction of binary, stationary, n-Markov sources. By modifying Context 
Tree Weighting 's recursive weighting scheme, it is possible to mix over a strictly larger class of 
models without increasing the asymptotic time or space complexity of the original algorithm. 
We prove that this generalization preserves the desirable theoretical properties of Context Tree 
Weighting on stationary n-Markov sources, and show empirically that this new technique leads 
to consistent improvements over Context Tree Weighting as measured on the Calgary Corpus. 

1 Introduction 



Context Tree Weighting llWillems et al.L 1199511 is a well-known, universal lossless compression al- 



gorithm for binary, stationary, n-Markov sources. It provides a striking exampl e of a technique tha t 



works well both in theory and practice. Similar to Prediction by Partial Matching llCleary et al.Ul984|] . 
Context Tree Weighting (CTW) uses a context tree data structure to store statistics about the current 
data source. These statistics are recursively combined by weighting, which leads to an elegant algo- 
rithm whose worst-case performance can be characterized by an analytic regret bound that holds for 
an y finite length d ata sequence, as well as asymptotically achieving (in expectation) the lower bound 



of lRissanenI 111984 1 for the class of binary, stationary n-Markov sources. 



This paper explores an alternative recursive weighting procedure for CTW, which weights over a 
strictly larger class of models without increasing the asymptotic time or space complexity of the orig- 
inal algorithm. We call this new procedure the Context Tree Switching (CTS) algorithm, which we 
investigate both theoretically and empirically. 



2 Background 

We begin with some notation and definitions for binary data generating sources. Our binary alphabet 
is denoted by X := {0, 1}. A binary string X1X2 . . . x„ G A"" of length n is denoted by Xi-n- The 
prefix Xi-j of Xi-n, j < n, is denoted by x<j or a;<j+i. The empty string is denoted by e. The 
concatenation of two strings s and r is denoted by sr. If iS is a set of strings and r G {0, 1}, then 
S X r := {sr : s G S}. We will also use l{s) to denote the length of a string s. 
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2.1 Probabilistic Binary Sources 

We define a probabilistic data generating source p to be a set of functions p„ : Af" — )■ [0, 1], for 
n G N, satisfying the constraint that Pn{xi;n) = J2yex Pn+ii^i-nV) for all Xi^n € A'", with base 
case po{e) = 1. As the meaning is always clear from the argument to p, we drop the subscripts on p 
from here onwards. Under this definition, the conditional probability of a symbol x„ given previous 
data x<„ is defined as p(x„|x<„) := p(a;i:„)/p(x<„) if p(x<n) > 0, with the familiar chain rule 
p{xi:n) = IV^=i p{xi\x<i) now following. 

2.2 Coding and Redundancy 

A source code c : X* X* assigns to each possible data sequence Xi-^ a binary codeword c(xi:„) 
of length lc{xi;n)- The typical goal when constructing a source code is to minimize the lengths of 
each codeword while ensuring that the original data sequence Xi-n is always recoverable from c{xi:n)- 
Given a data generating source p, we know from Shannon's Source Coding Theorem that the optimal 
(in terms of expected code length) source code c uses codewords of length — log2 p{xi-n) bits for all 
Xi-n- This motivates the notion of the redundancy of a source code c given a sequence Xi-n, which is 
defined as rc{xi.n) '■= lc{xi:n) + log2 /^(^i:n)- Provided the data generating s ource is known, nea r 
optimal redundancy can essentially be achieved by using arithmetic encoding [|Witten et al.l . Il987h . 



More precisely, using to denote the source code obtained by arithmetic coding using probabilistic 
model p, the resultant code lengths are known to satisfy 

L,X^l:n) < [- log2 +2, (1) 

which implies that ra^(xi;„) < 2 for all Typically however, the true data generating source p is 
unknown. The data can still be coded using arithmetic encoding with an alternate model p, however 
now we expect to use an extra [log2 p(a;i;„) / p(a;i;„)] bits to code the random sequence ~ p. 

2.3 Weighting and Switching 

This section describes the two fundamental techniques, weighting and switching, that are the key 
building blocks of Context Tree Weighting and the new Context Tree Switching algorithm. 

2.3.1 Weighting 

Suppose we have a finite set := {pi, p2, . . . , Pat}, for some G N, of candidate data generating 
sources. Consider now a source coding distribution C, defined as 

^(Xi.n) := W^p{Xi:n) (2) 

peM 

formed by weighting each model by a real number Wq > such that J2peM ~ Notice that if 
some model p* G is a good source coding distribution for a data sequence then provided n is 
sufficiently large, ^ will be a good coding distribution, since 

- loga ^{xi:n) = - log2 <p(a;i:„) < - loga w^p{xi,n) = " logg < - log^ p(xi;„) (3) 

holds for all p G Ai. Therefore, we would only need at most an extra — loggWo bits, an amount 
independent of n, to transmit Xi-n using ^ instead of the best model p* in A^. An important special 
case of this result is when | | = 2 and Wq^ = Wq^ = |, when only 1 extra bit is required. 

2.3.2 Switching 

While weighting provides an easy way to combine models, as an ensemble method it is somewhat 
limited in that it only guarantees performance in terms of the best single model in Al. It is easy to 
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imagine situations where this would be insufficient in practice. Instead, one could consider weighting 
over sequences of models chosen from a fixed base class Ai . Variants of this fundamental idea have 
been considered by authors from quite different research communitie s. Within the data comp ression 
community, there is the Switching Method and the Snake algorithm [ Volf and Willemsi 1998 1. Sim- 
ilar approaches were also cons idered in the online learning community, in particular the Fixed-Share 
IIHerbster and Warmuthl Il998|] algorithm for tracking the best expert over time. From the machine 
learning community, related ideas \ yere investigated in the context of Bayesian model selection, giv- 
ing rise to the Switch D istribution llvan Erven et al.l 120071] . The setup we use draws most heavily on 
llvan Erven et al. . though there appears to be considerable overlap amongst the approaches. 

Definition 1, Given a finite model class M. = {pi, . . . , Pn} with N > 1, for all n G N, for all 
Xi-n G X^, the Switch Distribution with respect to Ai is defined as 



= J2 

il:neI„{M) k = l 

where the prior over model sequences is recursively defined by 



(4) 



1 if k:n = e 
jj if n = 1 



if n 

W{i<n) X 



(1 - ttnWn = in-l] + 



Tl^in + 



^n-1 



Otherwise, 
..,iV}«}. 



with each switch rate G [0, l]/or \ <k <n, andXn{Ai) := {x G {1, 2, 

(xi-n) as we did with - logg ^{xi-n) 



in Section 



(5) 
By itself. 



Now, using the same argument to bound — log2 r, 
1 , we see that the inequality 

-hg^TaiXi-.n) < -log2w(zi:„) - logs Pi^^^ (Xi:„) 

holds for any sequence of models ii,n G X„(A^), with Pii.„(a;i;„) := Y['k=i Piki^k\x<:k 
Equation [5] provides little reassurance since the — log2 term might be large. However, by 

decaying the switch rate over time, a meaningful upper bound on — logg w(«i:„) that holds for any 
sequence of model indices ii:„ G X„(A^) can be derived. 

Lemma 1. Given a base model class M. and a decaying switch rate at '■= \for t G N, 

- loga wiiv.n) < (m(zi:„) + 1) (log2 \M I + log2 n) , 
for all ii;n G X„(A^), where m{ii-n) '■= Sfc=2 ^[^fe 7^ ^fc-i] denotes the number of switches in ii-n- 
Proof. See Appendix B. □ 

Now by combining Equation [5] with Lemma [Hand taking the minimum over In{Ai) we get the 
following upper bound on — log2 Tq,(xi:„). 

Theorem 1, Given a base model class M. and switch rate at '■= \for t G N, for all n G N, 



-log2r„(a;i. 



< min 



(m(ii:„) + 1) [log2 \M\ + log2n] - log2Pii^„(Xi;„) 



Thus if there exists a good coding distribution pj^.,^ such that m{ii-n) ^ n then will also be 
a good coding distribution. Additionally, it is natural to compare the performance of switching to 
weighting in the case where the best performing sequence of models satisfies m(ii:„) = 0. Here an 
extra cost of 0(log n) bits is incurred, which is a small price to pay for a significantly larger class of 
models. 
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Algorithm A direct computation of Equation|4]is intractable. For example, given a data sequence Xi-n 
and a model class Ai, the sum in Equation|4]would require \ A4 1" additions. Fortunately, the structured 
nature of the model sequence weights can be exploited to derive Algorithm[Tl whose proof of 

correctness can be found in Appendix A. Assuming that every conditional probability can be computed 
in constant time. Algorithm [T] runs in 0(n|A^|) time and uses only G(|A^|) space. Furthermore, only 
0(1 I) work is required to process each new symbol. 



Algorithm 1 Switch Distribution - Ta{xi;n) 

Require: A finite model class A4 — {pi, . . . , pn} such that iV > 1 
Require: A weight vector {wi , . . . , wn) G , with Wi = for 1 < i < 
Require: A sequence of switching rates {a2, as, . . . , a„} 

1: r ^ 1 

2: for i = 1 to n do 

N 

3: r <- J2 WjPj{xi\x<i) 

4: fc ^ (1 - a,+i)A^ - 1 

5: forj^ltoiVdo 

6: Wj K+ir + kwjpj{xi\x<i)] 

7: end for 
8: end for 
9; return r 



Discussion The ab ove switching technique can be used in a variety of ways. For example, draw- 
ing inspiration from Volf and Willemsl [1998], multiple probabilistic models (such as PPM and CTW) 
could be combined with this technique, with the conditional probability Tq,(x„|x<„) of each symbol 
Xn give n by the ratio Tnixyr,) /T n(x^n). This seems to be a direct improvement over the Switching 
Method I Volf and Willemsl 1 1998|] . since similar theoretical guarantees are obtained, while additionally 
reducing the time and space required to process each new symbol x„ from 0{n) to 0(|A^|). This, 
however, is not the focus of our paper. Rather, the improved computational properties of Algorithm 
[U motivated us to investigate whether the Switch Distribution can be used as a replacement for the 
recursive weighting operation inside CTW. It is worth pointing out that the idea of using a swi tching 
method recursively inside a context tree had been discussed before in Appendix A of [ Volf , 20021) . This 
discussion focused on some of the challenges that would need to be overcome in order to produce a 
"Context Tree Switching" algorithm that would be competitive with CTW. The main contribution of 
this paper is to describe an algorithm that achieves these goals both in theory and practice. 

2.4 Context Tree Weighting 

As our new Context Tree Switching approach ex tends Context Tree Weighti ng, we must first review 
some of CTW's technical details. We recommend llWillems et al .1 . 1 199511199711 for more information. 



2.4.1 Overview 

Context Tree Weighting is a binary sequence prediction technique that works well both in theory 
and practice. It is a variable order Markov modeling technique that works by computing a "double 
mixture" over the space of all Prediction Suffix Trees (PSTs) of bounded depth D E N. This involves 
weighting (see Section l2.3.1b over all PST structures, as well as integrating over all possible parameter 
values for each PST structure. We now review this process, beginning by describing how an unknown, 
memoryless, stationary binary sources is handled, before moving on to describe how memory can be 
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added through the use of a Prediction Suffix Tree, and then finishing by showing how to efficiently 
weight over all PST structures. 

2.4.2 Memoryless, Stationary, Binary Sources 

Consider a sequence Xi-n generated by successive Bernoulli trials. If a and b denote the number of 
zeroes and ones in respectively, and 6 & [0,1] CM denotes the probability of observing a 1 on 
any given trial, then Pr(xi:„ | 6) = 6^{1 — OY. One way to construct a distribution over Xi-n, in the 
case where 9 is unknown, is to weight over the possible values of 9. A good choice of weighting can be 
obtained via an objective Bayesian analysis, which suggests using the weighting w{9) := Beta(|,|) = 
^-ig-i/2|-]^ — 9)'^^"^. The resul tant estimator is known as the Krichevsky-Trofimov (KT) estimator 



IlKrichevsky and Trofimovill981|] . The KT probability of a binary data sequence Xi-n is defined as 



iKT{Xl:n) ■= f 9\l ~ 9f w{9) d9 , (6) 

for all n G N. Furthermore, ^KT{xi:n) can be efficiently computed online using the identities 

. / , N a + 1/2 ^ , , , 6 + 1/2 

ixAXn = X<„) = ■ , . , UAXn = 1 X^n) = , , , , (7) 

a+0+1 a+0+1 
in combination with the chain rule ^XT(a^i:n) = ^KT{xn\x<n) x ^KT{x<n)- 

Parameter Redundan cy The parameter red undancy of the KT estimator can be bounded uniformly. 



Restating a result from lWillems et al.l 11 199511 . one can show that for all n E N, for all Xi-n G A*", for 
all [0,1], 

l0g2 / , \ <|l0g2H + l. (8) 
^KT{Xl:n) 

This result plays an important role in the analysis of both CTW and CTS. 

2.4.3 Variable-length Markov, Stationary, Binary Sources 

A richer class of data generating sources can be defined if we let the source model use memory. 



A finite, variable order, binary Markov model llBegleiter et al.l 1200411 is one such model. This can 



equivalently be described by a binary Prediction Suffix Tree (PST). A PST is formed from two main 
components: a structure, which is a binary tree where all the left edges are labeled 1 and all the right 
edges are labeled 0; and a set of real-valued parameters within [0, 1], with one parameter for every leaf 
node in the PST structure. This is now formalized. 

Definition 2. A suffix set S is a collection of binary strings. S is said to be proper if no string in S is 
a suffix of any other string in S. S is complete if every semi-infinite binary string ■ ■ ■ x„_2X„_iX„ has 
a suffix in S. S is of bounded depth D G N ifl{s) < D for all s E S. 

A binary Prediction Suffix Tree structure is uniquely described by a complete and proper suffix set. 
For example, the suffix set associated with the PST in Figure[I]is S := {1, 10, 00}, with each suffix 
s G iS describing a path from a leaf node to the root. 

Definition 3. A PST is a pair (S, O5), where S is a suffix set and 0^ := {9^ : 9s G [0, l]}se>s- 
depth of a suffix set S is defined as d{S) := max^g^ /(s). The context with respect to a suffix set S of 
a binary sequence Xi-,n ^ i^ defined as (j)s{xi-,n) ■= Xk-n? where k is the unique integer such that 

Xk:n € <S- 

Notice that (psixi-n) niay be undefined when n < d{S). To avoid this problem, from here onwards 
we adopt the convention that the first d{S) bits of any sequence are held back and coded separately. 
By denoting these bits as Xd-i ■ ■ ■ X-iXq, our previous definition of 4>s{xi:n) is always well defined. 
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Figure 1 : An example prediction suffix tree 

Semantics A PST [S, O5) maps each binary string Xi:n with n > d{S) to a parameter value 6(i,g{x-L.„), 
with the intended meaning that Pr(xji^i — 1 | Xi-n) — ^4>s{^i n)- For example, the PST in Figure [T] 
maps the string 1 1 10 to ^,^5(1110) = ^10 = 0-3, which means the next bit after 1110 takes on a value of 
1 with probability 0.3, and a value of with probability 0.7. If we let bg and denote the number of 
times a 1 and is seen in context s respectively, this gives 

PT{x,.,r,\S,es):=Yl9'^il-0sr. (9) 

Unknown Parameters Given a PST with known structure S but unknown parameters G5, a good 
coding distribution can be obtained by replacing each unknown parameter value 6s € Qs with a KT 
estimator. If we let denote the (possibly non-contiguous) subsequence of data Xi._n that matches 
context s E S, this gives 

FT{Xi..n\S):=ll^KT{xU- (10) 

This choice is justified by the analysis of Willems et al.l lll995 1. If we let 



k if < A; < 1 

Jlog2(A;) + l iffc>l, 

the parameter redundancy of a PST with known structure S can be bounded by 



Pr(a;i:n 




Pr(xi: 


n\S) 



iog2 ' ; J' < (11) 



2.4.4 Weighting Over Prediction Suffix Trees 

The Context Tree Weighting algorithm combines the data partitioning properties of a PST, a carefully 
chosen weighting scheme, and the distributive law to efficiently weight over the space of PST structures 
of bounded depth D G N. We now introduce some notation to make this process explicit. 

Definition 4. The set of all complete and proper suffix sets of bounded depth D is denoted by Cd, and 
is given by the recurrence 

1 {{e}}u{5ixlu52xO:5i,52eCz5-i} zjl^>0. ^ ^ 

Notice that \Cd \ grows roughly double exponentially in D. For example, |Co| = 1, |Ci| = 2, IC2I = 
5, IC3I = 26, \Ci\ = 677, IC5I = 458330, which means that some ingenuity is required to weight over 
all Cd for any reasonably sized D. This comes in the form of a weighting scheme that is derived from a 
natural prefix coding of the structure of a PST. It works as follows: given a PST structure with depth no 
more than D, a pre-order traversal of the tree is performed. Each time an internal node is encountered 
for the first time, a 1 is written down. Each time a leaf node is encountered, a is written if the depth 
of the leaf node is less than D, otherwise nothing is written. For example, if D = 3, the code for the 
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model shown in Figure[I]is 10100; if D = 2, the code for the same model is 101. We now define the 
cost ToiS) of a suffix set S to be the length of its code. One can show that '^^^Cd S^^^*-*^-* = 1; i.e. 
the prefix code is complete. Thus we can now define 

CTWz)(xi:„) := 5] 2-r-('5) J]e/rrK:J. (13) 

Notice also that this choice of weighting imposes an Ockham-like penalty on large PST structures. 

Recursive Decomposition If we let /C/j := {0, 1}* denote the set of all possible contexts for class 
Cd, x\.^ denote the subsequence of data Xi-n that match es context c £ /Cn, a nd define CTW^(xi:„) : = 



CT^ £){xi-n), we can decompose Equation [13] into (see llWillems et al .1119951] ') 

CTW^(a;i:„) = \ iKT{xln) + |cTw2)'_i(a;i;„) CTW^^_i(xi:n), (14) 
iovD > 0. In the base case of a single node (i.e. weightingover Cq) we have CTWq(xi:„) = C,KT{xl.n)- 

Computational Properties The efficiency of CTW derives from Equation [TU since the double mix- 
ture can be maintained incrementally by applying it D+1 times to process each new symbol. Therefore, 
using the Context Tree Weighting algorithm, only 0{nD) time is required to compute CTW£)(xi:„). 
Furthermore, only 0{D) work is required to compute CTW£)(xi:„+i) from CTW£)(xi:„). 

Theoretical Properties Using Equation [3l the model redundancy can be bounded by 



VseCo ses / ses 

This can be combined with the parameter redundancy specified by Equation [TT] to give 



log2CTWB(xi.„) < 1^,(5) + |5|7 ^ -log2Pr(xi;„|5,e5) (15) 



for any S eCd- Finally, combining Equation [T5] with the coding redundancy bound given in Equation 
[T] leads to the main theoretical result for CTW. 



Theorem 2 (]Willems et al.l []1995|] ). For all n G N, given a data sequence Xi-n G generated by a 
binary PST source (5, &$) with S ^ Co and Qs := {Og '■ Og £ [0, l]}sg5, the redundancy of CTW 
using context depth D ^ N is upper bounded by r£)(5) + \S\'-f (^||| j + 2. 

3 Context Tree Switching 

Context Tree Switching is a natural combination of CTW and switching. To see this, first note that 
Equation [14] allows us to interpret CTW as a recursive application of the weighting method of Section 
12.3.11 Recalling Theorem [B we know that a careful application of switching essentially preserves the 
good properties of weighting, and may even work better provided some rarely changing sequence of 
models predicts the data well. Using a class of PST models, it seems reasonable to suspect that the 
best model may change over time; for example, a large PST model might work well given sufficient 
data, but before then a smaller model might be more accurate due to its smaller parameter redundancy. 
The main insight behind CTS is to weight over all sequences of bounded depth PST structures by 
recursively using the efficient switching technique of Section [2.3.21 as a replacement for Equation [T4| 
This gives, for all n G N, for all Xi-n G X"-, the following recursion for D > Q, 



n:„,e{o,i}"'^ fc=i 



UT{[xUl:k) , ... _,iCTs23=_i(xi:t^(fe))CTS)3=_i(xi:t^(fc) 



(16) 
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for c G /C/5, where n^. := and tc(^) is the smallest integer such that /(x^.^ ^^-j) = fc. In the base 

cases we have CTSo(xi:„,) := iKT{.x\.n) and CTS^(e) := 1 for any D E'H, c & ICd- 

We now specify the CTS algorithm, which involves describing how to maintain Equation [16] effi- 
ciently at each internal node of the context tree data structure, as well as how to select an appropriate 
sequence of switching rates (which defines Wc(^i:nc)) for c^ch context. Also, from now onwards, we 
will use CTS£)(xi:„) to denote the top-level mixture CTS^(xi:„). 

3.1 Algorithm 

CTS repeatedly applies Algorithm [T] to efficiently maintain Equation [16] at each distinct context. 
This requires maintaining a context tree, where each node representing context c contains six entries: 
^KTi^l-n) and associated a,., counts, CTS^(xi:„) and two weight terms kf. and Sc which we define 
later. Initially the context tree data structure is empty. Now, given a new symbol Xn, having previously 
seen the data sequence a;<„, the context tree is traversed from root to leaf by following the path defined 
by the current context 0d(x<„) := x„_ix„_2 • • • a^n-D- If, during this process, a prefix c of (f)D{x<n) 
is found to not have a node representing it within the context tree, a new node is created with kc : = 1/2, 
Sc := 1/2, Oc = 0, and be = 0. Next, the symbol Xn is processed, by applying in order, for all nodes 
corresponding to contexts c G {(t>D{xi:n), • • • , (pi{xi;n), e}, the following update equations 

kc ^ <+i CTS^(Xi.„) + (1 - 2<+i) kc^KTix'^ I <„) 
Sc ^ ati+i CTS^(a;i:„) + (1 - 2a^_^i) Sc z'j^ixn \ a;<„), 
for D > 0, where ^i^r«|a;<„) := ^xr(a;i;„)/Cxr(a;<„) and 

^Z)('^n|*^<"-) [CTS^_-^ (Xi;„)/CTS£)'_-^ [CTS£)_-^ (Xi;„)/CTS£)'_-^ , 

proceeding from the leaf node back to the root. In the base case we have CTSo(xi:„) := C,KT{xi-n)- 
In addition, for each relevant context, ^xT(a^i „) is updated by applying Equation [7] and incrementing 
either Oc or &c by 1 . As CTS is identical to CTW except for its constant time recursive updating scheme, 
its asymptotic time and space complexity is the same as for CTW. 

Setting the Switching Rate The only part of Equation [16] we have not yet specified is how to set 
the switching rate a^. With Theorem [I] in mind, our first thought was to use = n~^. However 
this choice gave poor empirical performance. Furthermore, with this choice we were unable to find a 
redundancy bound competitive with Equation [TS] Instead, a much better alternative was to set = 
n^^ for any sub-context. The next result justifies this choice. 

Theorem 3, For all n EN, for all Xi:„ G A"", for all D EN, we have 

- log2 CTSz5(xi;„) < Td{S) + [diS) + 1] log2 n + |5|7(^) - loga Pr(a;i:„ | S, 65), (17) 

for any pair {S, G) where S E Cd and B5 := {Og : Og E [0, l]}se5. 

Proof. See Appendix C. □ 

This is a very strong result, since it holds for all binary PST models of maximum depth D, and all 
possible data sequences, without making any assumptions (probabilistic or otherwise) on how the data 
is generated. Additionally, Theorem |3] lets us state a redundancy bound for CTS when it is combined 
with an arithmetic encoder to compress data generated by a binary, n-Markov, stationary source. 

Corollary 1. For all n E N, given a data sequence Xi-n E generated by a binary PST source 
(iS, Qs) with S E Cd and 65 := {Og : Os E [0, l]}se<s. ^he redundancy of CTS using a context depth 
D ENis upper bounded by ToiS) + [d{S) + 1] loggn + |5|7(^) + 2. 
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Table 1: Performance (average bits per byte) of CTW and CTS with a fixed D on the Calgary Corpus 



Comparing the redundancy bounds in Equation [17] with Equation [TSl we see that CTS bound is 
slightly looser, by an additive [d{S) + 1] log2 n term. However this is offset by the fact that CTS 
weights over a much larger class than CTW. If the underlying data isn't generated by a single binary 
PST source, it seems reasonable to suspect that CTS may perform better than CTW. Notice too that as 
n gets large, both methods have 0(log2 n) redundancy behavior for stationary, Z^-Markov sources. 



4 Experimental Results 

We now investigate the performance of Context Tree Switching empirically. For this we measured 
the performance of CTS on the well known Calgary Corpus - a collection of files widely used to 
evaluate compression algorithms. The results (in average bits per byte) are shown in Table [T] 

The results for CTW48, CTS48, CTS48 and CTS^^qq were generated from our own implementatiorQ, 
which used a standard binary arithmetic encoder to produce the compressed files. The CTW48, CTS48 
methods refer to the base CTW and CTS algorithms using a context depth of D=48 (6 bytes). Both 
methods used the KT estimator at leaf nodes, and contained no other enhancements. CTS4g and 
CTS^f;n referred to our enhanc ed versions of CTS. These used the binary decomposition method from 
llWillems and Tjalkensl 1199711 and a technique similar to count halving, which multiplied and 
by a factor of 0.98 during every update. Additionally, Sc and were initialized to 0.925 and 0.075 
respectively for each c G /Cd upon node creation. The remaining CTW48 results are frorn a state-of-the- 
art CTW implementation made public by algorithm's original creators I Willems , 201 ill . This version 
features important enhancements such as replacing the KT estimator with the Zero-Redundancy es- 
timator, binary decomposition for by te oriented data, weighting only at byte boundaries and count 
halving [Willems and Tjalkensl [l997n . Various combinations of these CTW enhancements were also 
tried with CTS, but were found to be slightly inferior to the CTS* method described above. 



PPM* 


CTW 


PPMZ 


CTS* 


DEPLUMP 


2.09 


1.99 


1.93 


1.93 


1.89 



Table 2: Weighted (by filesize) Average Bits per Byte on the Calgary Corpus 

The first two rows in Table [T] compare the performance of the base CTW and CTS algorithms. Here 
we see that CTS generally outperforms CTW, in some cases producing files that are 7% smaller. In 
the cases where it is worse, it is only by a margin of 1%. The third and fourth rows compare the 
performance of the enhanced versions of CTW and CTS. Again we see similar results, with CTS 
performing better by up to 8%; in the single case where it is worse, the margin is less t han 1%. Finally , 



Table[2]shows the performance of CTS (using D=160) relative to the results reported in [[Gasthaus et al 

Here we see that CTS's performance is excellent, comparable with modern PPM techniques 
such as PPMZ [ i Bloora.,1998^1 and only slightly inferior to the recent Deplump algorithm. 



'Available at: |http : / / jveness .info/ software/ cts-vl . zip| 
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5 Conclusion 



This paper has introduced Context Tree Switching, a universal algorithm for the compression of 
binary, stationary, n-Markov sources. Experimental results show that the technique gives a small but 
consistent improvement over regular Context Tree Weighting, without sacrificing its theoretical guar- 
antees. We feel our work is interesting since it demonstrates how a well-founded data compression 
algorithm can be constructed from switching. Importantly, this let us narrow the performance gap 
between methods with strong theoretical guarantees and those that work well in practice. 

A natural next step would be investigate whether CTS can be extended for binary, A;-Markov, piece- 
wise stationary sources. This seems possible with some simple modifications to the base algorithm. For 
example, the KT estimator could be replaced with a tech nique tha t work s for unknown, menigryless , 
piecewise stationary sources, such as those discussed by 'Willemsl [1996], Willems and Kronil [1997]. 



Theoretically characterizing the redundancy behavior of these combinations, or attempting to derive 
a practical algorithm with provable redundancy behavior for A; -Markov, piecewise stationary sources 
seems an exciting area for future research. 
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Appendix A. Correctness of Algorithm [D 

This section proves the correctness of Algorithm[T] We begin by first proving a lemma. 

Lemma 2. Ifwj^t denotes the weight Wj at the beginning of iteration t in Algorithm\l\ the identity 

t-i 

i<t k=l 

holds for all t CzN. 

Proof. We use induction on t. In the base case, we have 

i<l 

which is what is required. Letting rj denote the value assigned to r on iteration t, for the inductive case we have 

Wj,t+i =Tr^[at+irt + {N{1 - at+i) - l)wj^tPj{xt \ x<t)] 

N 

_£t + l \ " 



t-1 



X<k) 



k=l 



Pj{xt I a;<t) + 



jV(l-at+i)-l 



k=l 



X<t) 



fc=i 



'-k=l 



Pji^t I a;<t) 



ii:t\it¥=j k=l i-i.t\it=j fe=l 

t 

XI wi^<tj)^^^^^W^l^Y[Piki^k\ X<k) 

il:t\it=j k=l 

t t 

= Y '^i^^--t)Tiz^W_Pik{xk\x<k)+ Y wi^<tj){'^'' Oit+i)W_Pt^{xk\x<k) 

il:t\it¥^j fe=l il:t\it=j k=l 

t 

^^w{ii..tj) WpiAxk I 

il:t k=l 



Theorem 4. Vn e N, Va;i:„ e A"", Algorithm\T}computes Ta{xi:n)- 

Proof. Letting Wj^t denote the weight wj at the beginning of iteration t, Algorithm[T]returns 

N N rt-1 -| 

^Wj,tpj{xt I x<t) ^YY'^^^<^^^ n^*'^^^*^ I ^<'=) I ^<*) 

j=l j = l i<t '-fc=l 

t 

= Xw(il:t) Jl p,,(Xfc I X<fc) 

/C=l 

= Ta{xi:t), 



□ 



where the first equality follows from Lemma|2l 



□ 
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Appendix B. Proof of Lemma [T] 

Lemma [TJ Given a base model class M. and a decaying switch rate at := \for t G N, 

- log2 w{ii;n) < (m(zi:„) + 1) (logg \M \ + loga n) , 
for all ii:„ G X„(A^), where m{ii,n) := X]fe=2 -"-[^fc ^^-i] denotes the number of switches in ii,n- 
Proof. Consider an arbitary G X„(A^). Now, letting m denote we have 

n 

-log2w(ii;„) = \og2\M\-\og2\{jj^l[it^it-i] + {l-atMt = it-i\ 

t=2 
n 

< log, \M I - log2 n Tmprfl'^ + + = 

i=2 

(n—m ^ 
(|>l|-l)-™n¥ 
i=2 



= log2 \M. \ + mloggn + mloggdA^I — 1) + log2(?^ — tvl) 
< (m + l)[log2 + log2«]. 



□ 
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Appendix C. Proof of Theorem |3] 

TheoremlH For all n e N, for all xv.n e X", for all D eN, we have 

- log2 CTSD{xi..n) < (5) + [d{S) + 1] \og^ n + |5|7(^) - log2 Pr(xi:„ | S, 65), 
for any pair (5, Q) where S £ Co and '■= {^s • Os £ [0, Ijjsg^. 

Proof. Consider an arbitrary S G Co and 85 — {9s : 9s 6 [0, l]}se5- Now define S C JCd as the set of contexts 
that index the internal nodes of PST structure S. Observe that, for all n S N and for all xi;n G by dropping the 
sum in Equation [T6l we can conclude 

{Wc(ll:nJ CTs2j_i(a;i:„) CTSjj^.i (xi:™) if C ^ 5 
Wc(Oi:„jeKT(a;f,„) ifcGcSandD>0 (18) 

Ut{xI,,,) ifD = 0, 

for any sub-context c G 5 U <S. Next define S' := {s a S : l{s) < D}. Now, by repeatedly applying Equation [181 
starting with CTS/)(xi:„) (which recall is defined as CTS'j~,{xi-n)) and continuing until no more CTS(-) terms remain, 
we can conclude 



CTSc(xi:„) > 



> 



n^-(ii-e) n ^«(0i-.) n 

^KT{xl,n) 

d(S) \ / \ 

n n MU:rj\iYl^KTKj) 

k=o ces'us ■.i{c)=k J \ses / 

k=o ceS'uS:i{c)=k y <=' I ^ \ses / 



y k=0 t=2 j \se5 / 

VseS / 

The first equality follows by noting that Definition [T]implies ti;c(Oi:t) = ^£(11:*) for alH G N and rearranging. The 
second inequality follows from \S' U <S| = Vu{S), Wc(l) = \ and that either Wc(li:nc) — 'i'c(e) = 1 if tt-c = or 
Wc(li:n^) I X . . . for Tic > 0. The last inequality follows from the observation that the context associated with 
each symbol in matches at most one context c G 5' U 5 of each specific length < fc < d{S). The final equality 
follows upon simplification of the telescoping product. Hence, 

- log2 CTS(xi:„) < ToiS) + [d{S) + 1] log2 n - log2 ( n ^kt{xI.J J . (19) 

\s6<S / 

Finally, the proof is completed by noting that EquationfTTI implies 

-log2 ( n^^^(^i-)) ^ l'5|7(^)-log2Pr(a:i:„|5,e5), 
and then combining the above with Equation[T9] □ 
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